Generic HPC Install Script #329

TimothyWillard · 2024-10-02T17:50:47Z

Describe your changes.

This adds a generic hpc_install.sh script which can reproducibly setup and install flepiMoP on both rockfish and longleaf. The script vaguely:

Figures out HPC specific variables and modules, currently only rockfish and longleaf are supported.
Loads sensitive credentials.
Sets up a FLEPI_PATH environment variable.
Sets up/updates a conda environment.
Ensures that the R/python versions of arrow are compatible. These checks are loose and not definitive.
Install custom R packages.
Set up environment variables common for use with flepiMoP.

Going to add a separate PR for documentation since that needs to be merged into gitbook-documentation. But user usage would look something like on rockfish:

wget https://raw.githubusercontent.com/HopkinsIDD/flepiMoP/refs/heads/GH-191/longleaf-batch-submission/build/hpc_install.sh
vim /scratch4/struelo1/flepimop-code/ext-twillard/slack_credentials.sh
chmod 600 /scratch4/struelo1/flepimop-code/ext-twillard/slack_credentials.sh
source hpc_install.sh rockfish

and on longleaf:

wget https://raw.githubusercontent.com/HopkinsIDD/flepiMoP/refs/heads/GH-191/longleaf-batch-submission/build/hpc_install.sh
vim /users/t/w/twillard/slack_credentials.sh
chmod 600 /users/t/w/twillard/slack_credentials.sh
source hpc_install.sh longleaf

And replacing the url with the appropriate one. Then keeping your environment up to date is as easy as:

source /scratch4/struelo1/flepimop-code/ext-twillard/flepiMoP/build/hpc_install.sh rockfish

on rockfish or:

source /users/t/w/twillard/flepiMoP/build/hpc_install.sh longleaf

on longleaf.

A big open question is how best to install packages. Right now it is installing gempyor, flepiconfig, flepicommon, and inference from GitHub rather than locally. Whereas I think installing locally would be preferred dev reasons, at least in the meantime.

What does your pull request address? Tag relevant issues.

One of many steps required for GH-191. Should resolve GH-308

Tag relevant team members.

@pearsonca, @shauntruelove, @MacdonaldJoshuaCaleb

Edit: Fix wget url to use the "raw" file instead of the pretty version.

Heavily inspired by the original `batch/slurm_init.sh` script. The init script is a run once script that takes care of installation of dependencies and setup whereas prerun sets env vars needed per a run.

Initial version of the HPC install script, some what inspired by the slurm init script.

* Changed how the R arrow version is formatted for readability. * Changed the final output command to print diagnostic info correctly.

Added slurm's --partition flag to the `batch/inference_job_launcher.py` script for usage on UNC's Longleaf cluster.

The longleaf specific init/pre-run scripts are now surpassed by the generic `build/hpc_install.sh` script.

Remove the --partition flag for the slurm partition to use from the inference job launcher script. This will be handled in a new flepiscripts script.

pearsonca

Looks generally good, but a few questions to address.

pearsonca · 2024-10-02T18:57:19Z

build/hpc_install.sh

+elif [[ $1 == "rockfish" ]]; then
+    # Setup general purspose user variables needed for RockFish
+    USERDIR="/scratch4/struelo1/flepimop-code/$USER/"
+


need to cd to USERDIR as well here?

and if we do, several of the $USERDIRs below can/must be eliminated

could add creating some hpc-wide environmental variables to the longleaf-setup repo. does that make sense to pair with this?

lastly ... bit weird that we're doing install here in scratch. why not in $HOME? i get doing projects on scratch.

per in-person conversation:

need to check the preferred location for libraries on longleaf & rockfish

maybe refer to that as $LIBDIR (or ACCLIBDIR or some such)

might want to move that as a generic variable to be set on the HPC, and if so - move that to the longleaf-setup directions (which could itself stand to be scriptified) and make that setup a prerequisite to this? (one downside to that would be other people on other HPCs wanting to use / modify this script - future problem?)

At least on the longleaf side it looks like /users is similar to $HOME, the documentation states "Think of it as a capacity expansion to your home directory." However, I think maybe the project directory should be moved to /work since that's high throughput and designed for active jobs. So my take is:

flepiMoP and flepimop-env stay in /users, especially for the conda env since that directory can get large and $HOME has some low and strict storage caps.

Move the project directory to /work since that'll actually need throughput for the job.

I still need to dig up the rockfish documentation. Longleaf docs: https://help.rc.unc.edu/getting-started-on-longleaf/#main-directory-spaces.

Thanks for these scripts - the install all worked great for me on longleaf. 🎉

I'm also open to having these things anywhere, but as @jcblemai said I think having everything (including the flepimop libraries) in /work or /scratch makes the most sense, including the flepiMoP folder itself. I understand how installing these in /users or /home would be ideal if flepiMoP was stable but from a practical perspective, I am operationally often changing things within flepimop and reinstalling things run-to-run, playing with my own different environments, jumping between branches, or jumping between different FLEPI_PATH 's (not ideal, but practically this is just what we've had to do with concurrent runs and changes). So for convenience it would be good to just have everything in the same place, imo.

Separately, with my experience with running stuff in the past I was confused with having to link the specific location of the flepimop-env . I'm fine either way, I just don't think I follow why the change

@saraloo is there a general class of the things you're changing?

I would put everything in /scratch/ on rockfish (as per the current doc)

~~This has been the case since 9ca12ed.~~ I see, this is not the case for the $USERDIR variable, done now as well.

and in /work/ on longleaf, for both convenience and speed.

This is done now. I think I am misinterpreting the docs (see https://help.rc.unc.edu/getting-started-on-longleaf/#main-directory-spaces) on the differences between /users and /work. @jcblemai what are the practical differences between the two? My interpretation was that /work was mean for high IO short term storage for active work whereas /users is designed for longer term lower IO (read okay?) storage for libraries/codebases.

I am operationally often changing things within flepimop and reinstalling things run-to-run, playing with my own different environments, jumping between branches, or jumping between different FLEPI_PATH

@saraloo is this normal operational behavior? This sounds like the installation script needs to be much more accommodating to flexibility if this is the case. For the different environments, do you mean switching between multiple conda envs? What makes each of these envs distinct? As far as jumping branches this script won't do anything to your flepiMoP clone, although you can switch the branch yourself and then run this script again to update the conda env with the code from that branch ("install" is a misnomer, it really should be "install or update", I'll change the script name and make sure this is clear when writing the documentation), does that accommodate this use case? As for different $FLEPI_PATHs this script checks if this env var is set before doing anything, and if it is just uses the set value so there should be no issue setting custom $FLEPI_PATHs. Have you tested this yet and does it accommodate your use case?

Responding to both simultaneously. No, i don't think this is normal behavior so feel free to make a judgement call on your end. Just in the past during larger periods of development which inevitably coincide with operational demands I was running two or three different diseases on significantly different gempyor and/or R inference setups from different conda environments (again, don't think this will necessarily be standard, especially now that more people can run stuff). Just flagging that there will be circumstances where flexibility is preferable and want to reduce the possibility of someone setting the wrong flepimop version they;re working on or something, or reducing having to jump around etc to switch branches.
And sorry, haven't tested the FLEPI_PATH bit yet but that makes sense and I don't anticipate any issues there with setting that.

As we move to the new workflow carl described today, it means that we will have custom branch for runs so you can envision someone running Flu and RSV from the same account but using two different flepiMoP branch. I however think this flexibility can be added alter with the pre-runs scripts that are mentionned below.

Sometime also when running to many parallel run we can have some filesystem lock on the packages, which is always annoying, but I would not worry about it too much.

RE to sara's questions: do we need to specify the emplacement of the conda environement ?

. @jcblemai what are the practical differences between the two? My interpretation was that /work was mean for high IO short term storage for active work whereas /users is designed for longer term lower IO (read okay?) storage for libraries/codebases.

This is correct, but flepiMoP does not support writing to other folder other than the project one, so we work from work.

build/hpc_install.sh

jcblemai · 2024-10-02T19:37:26Z

That's really great thank you, ideally we would have a per-cluster specific configuration file that would populate some variables like:

where is project_dir
where would the final files go
what are the pre-processing steps needed before the base flepiMoP one.
Then some cluster-agnostic script would go.

Bit 3. will be used also by the runner script.

We decided to do different commands on the doc instead of a script so that the error are not silent and gradually reported. I would make sure the script exit on failure maybe using set -e and set -x.

rockfish.yml

paths:
  - final_output_path: /scratch4/struelo1/flepimop-runs/
  - project_path: /scratch4/struelo1/flepimop-code/
  - secrets_path: $USER/flepi_secrets.sh

init_commands:
  - module purge
  - module load gcc/9.3.0
  - module load git
  - module load git-lfs
  - module load slurm
  - module load anaconda3/2022.05
  - conda activate flepimop-env

(perhaps it's better if the above is a bash file)

build/hpc_install.sh

* Changed `flepiMoP` git clone to use ssh instead of http to allow for edits from HPC. * Add `set -e` to error clearly on a command failure. * Install `gempyor` from cloned `flepiMoP` repo directly, yet to do the same for R packages.

jcblemai · 2024-10-03T21:36:59Z

Also because the above comments focuses on what's missing but it's really awesome that this runs on longleaf (and would have been very useful right now, except that I'm running emcee).

flepimop/R_packages/inference/R/install_cli.R

build/setup.R

Clean up error handling after script exits a bit more nicely.

TimothyWillard · 2024-10-21T21:28:19Z

I like installing with pip -e because it means I can update gempyor very fast/switch branch and have the new thing ready.

This has been added now.

Your first comment about how to use this PR should be added to the doc right ? and replace manual environment creation.

The first comment is a bit out of date now, usage has changed some. However, I plan on replacing the current HPC install/update guides in the flepiMoP wiki in a separate PR into the documentation-gitbook branch and reference it to this PR.

Conda environment management has changed some from the discussion in #329 (comment) per an slack discussion with @jcblemai and @saraloo. Will now use a default environment in ~/.conda rather than an absolute path in the work directories of the clusters.

pearsonca · 2024-10-22T02:28:01Z

Not sure where @jcblemai's comment went re branches wanted for different operational runs, but working trees seem like an option: https://stackoverflow.com/questions/2048470/git-working-on-two-branches-simultaneously - might be something to do via init script?

Switch from using a conda environment specified by a path to a conda environment specified by a name assumed to be in `~/.conda`.

This option is not compatible with the `--editable` flag.

TimothyWillard · 2024-10-22T17:09:50Z

Not sure where @jcblemai's comment went re branches wanted for different operational runs, but working trees seem like an option: https://stackoverflow.com/questions/2048470/git-working-on-two-branches-simultaneously - might be something to do via init script?

This is currently handled in batch/inference_job_launcher.py which is I think where you want to handle branching for job submission and hasn't been touched by this PR yet. Do you want me to make changes to that in this PR as well now? Maybe this is something better fit for https://github.com/ACCIDDA/flepiscripts/pull/1?

Looks good to me, provided a small batch run works with it.

After the recent round of edits I was able to submit one of the recent Flu configs to rockfish using an environment setup and initialized using the build/hpc_install_or_update.sh and batch/hpc_init.sh scripts. It was a small submission (2 jobs, 200 iterations) so the actual outputs aren't helpful, but does demonstrate that the tools provided in this PR can setup a HPC environment suitable for batch submission.

pearsonca · 2024-10-22T17:35:56Z

This is currently handled in batch/inference_job_launcher.py which is I think where you want to handle branching for job submission and hasn't been touched by this PR yet. Do you want me to make changes to that in this PR as well now? Maybe this is something better fit for ACCIDDA/flepiscripts#1?

Let's aim for "next up" on that - I'd like an integration of these scripts w/ the CLI to support flepimop update and flepimop init ... actions.

TimothyWillard · 2024-10-22T17:41:54Z

Let's aim for "next up" on that - I'd like an integration of these scripts w/ the CLI to support flepimop update and flepimop init ... actions.

Sure, can you create an issue for that to move discussion of details there?

Required resolving conflicts in `inference`'s `DESCRIPTION` and `install_cli.R`.

The merge-base changed after approval.

TimothyWillard added 10 commits September 13, 2024 12:27

Copy slurm_init.sh to slurm_init_longleaf.sh

aa0c077

Restore slurm_init.sh

22cb130

Merge branch 'copy-file' into GH-191/longleaf-batch-submission

90e29b2

Added UNC Longleaf Specific Init/Prerun Scripts

c1d54ef

Heavily inspired by the original `batch/slurm_init.sh` script. The init script is a run once script that takes care of installation of dependencies and setup whereas prerun sets env vars needed per a run.

Draft implementation of HPC install script

e238d66

Initial version of the HPC install script, some what inspired by the slurm init script.

Minor changes to hpc_install.sh

2436640

* Changed how the R arrow version is formatted for readability. * Changed the final output command to print diagnostic info correctly.

Added slurm --partition flag to inference script

bba583a

Added slurm's --partition flag to the `batch/inference_job_launcher.py` script for usage on UNC's Longleaf cluster.

Initial pass at HPC install on rockfish

2c8c952

Remove longleaf specific slurm scripts

194d2e1

The longleaf specific init/pre-run scripts are now surpassed by the generic `build/hpc_install.sh` script.

Remove --partion flag

f76d68a

Remove the --partition flag for the slurm partition to use from the inference job launcher script. This will be handled in a new flepiscripts script.

TimothyWillard linked an issue Oct 2, 2024 that may be closed by this pull request

make the batch submisson cluster agnostic #191

Open

TimothyWillard requested review from pearsonca, shauntruelove and MacdonaldJoshuaCaleb October 2, 2024 17:51

pearsonca requested changes Oct 2, 2024

View reviewed changes

jcblemai requested changes Oct 2, 2024

View reviewed changes

build/hpc_install.sh Outdated Show resolved Hide resolved

build/hpc_install.sh Outdated Show resolved Hide resolved

pearsonca reviewed Oct 2, 2024

View reviewed changes

build/hpc_install.sh Outdated Show resolved Hide resolved

jcblemai reviewed Oct 2, 2024

View reviewed changes

build/hpc_install.sh Outdated Show resolved Hide resolved

jcblemai reviewed Oct 2, 2024

View reviewed changes

build/hpc_install.sh Outdated Show resolved Hide resolved

Minor updates to hpc_install.sh

8af3698

* Changed `flepiMoP` git clone to use ssh instead of http to allow for edits from HPC. * Add `set -e` to error clearly on a command failure. * Install `gempyor` from cloned `flepiMoP` repo directly, yet to do the same for R packages.

TimothyWillard force-pushed the GH-191/longleaf-batch-submission branch from 95035a4 to 3be416d Compare October 4, 2024 16:29

pearsonca reviewed Oct 7, 2024

View reviewed changes

flepimop/R_packages/inference/R/install_cli.R Outdated Show resolved Hide resolved

pearsonca reviewed Oct 7, 2024

View reviewed changes

build/setup.R Outdated Show resolved Hide resolved

pearsonca reviewed Oct 7, 2024

View reviewed changes

build/setup.R Outdated Show resolved Hide resolved

TimothyWillard force-pushed the GH-191/longleaf-batch-submission branch 2 times, most recently from e9344cb to b1e6670 Compare October 7, 2024 14:10

pearsonca added 2 commits October 9, 2024 13:23

initial tweaks to make flepimop-inference-* runnable

23d9de1

further install scripts fixes

9ac9cf0

Cleanup error handling

b73b5ab

Clean up error handling after script exits a bit more nicely.

TimothyWillard dismissed stale reviews from jcblemai and pearsonca via aa1de4f October 21, 2024 21:24

Update conda env to use ~/.conda

70ced08

Switch from using a conda environment specified by a path to a conda environment specified by a name assumed to be in `~/.conda`.

TimothyWillard force-pushed the GH-191/longleaf-batch-submission branch from aa1de4f to 70ced08 Compare October 22, 2024 14:07

TimothyWillard added 2 commits October 22, 2024 10:13

Remove --force-reinstall from pip install

b34f9ab

This option is not compatible with the `--editable` flag.

Minor typo

8d8042a

TimothyWillard requested review from jcblemai and pearsonca October 22, 2024 17:03

pearsonca mentioned this pull request Oct 22, 2024

[Feature request]: incorporate update / init capabilities into flepimop CLI #355

Open

jcblemai previously approved these changes Oct 22, 2024

View reviewed changes

pearsonca previously approved these changes Oct 22, 2024

View reviewed changes

TimothyWillard mentioned this pull request Oct 22, 2024

[Bug]: Errors Using hpc_install_or_update.sh On UNC's Longleaf Cluster #356

Closed

Merge main into GH-191/longleaf-batch-submission

86b5cd6

Required resolving conflicts in `inference`'s `DESCRIPTION` and `install_cli.R`.

TimothyWillard requested review from jcblemai and pearsonca October 22, 2024 19:09

pearsonca approved these changes Oct 22, 2024

View reviewed changes

jcblemai approved these changes Oct 23, 2024

View reviewed changes

TimothyWillard merged commit 24d243c into main Oct 23, 2024
3 checks passed

TimothyWillard deleted the GH-191/longleaf-batch-submission branch October 23, 2024 13:23

gitbook-com bot pushed a commit that referenced this pull request Oct 23, 2024

GITBOOK-240: Documentation For Generic HPC Install Script, GH-329

a807b0f

TimothyWillard added a commit that referenced this pull request Oct 24, 2024

GITBOOK-240: Documentation For Generic HPC Install Script, GH-329

8ffd9ac

TimothyWillard mentioned this pull request Nov 15, 2024

[Feature request]: User Local Install Documentation #401

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Generic HPC Install Script #329

Generic HPC Install Script #329

TimothyWillard commented Oct 2, 2024 •

edited

Loading

pearsonca left a comment

pearsonca Oct 2, 2024

pearsonca Oct 2, 2024

pearsonca Oct 2, 2024

pearsonca Oct 2, 2024

TimothyWillard Oct 3, 2024

saraloo Oct 17, 2024

pearsonca Oct 17, 2024

TimothyWillard Oct 18, 2024 •

edited

Loading

saraloo Oct 18, 2024

jcblemai Oct 21, 2024

jcblemai commented Oct 2, 2024 •

edited

Loading

jcblemai commented Oct 3, 2024

TimothyWillard commented Oct 21, 2024

pearsonca commented Oct 22, 2024

TimothyWillard commented Oct 22, 2024

pearsonca commented Oct 22, 2024

TimothyWillard commented Oct 22, 2024 •

edited

Loading

Generic HPC Install Script #329

Generic HPC Install Script #329

Conversation

TimothyWillard commented Oct 2, 2024 • edited Loading

Describe your changes.

What does your pull request address? Tag relevant issues.

Tag relevant team members.

pearsonca left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

TimothyWillard Oct 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jcblemai commented Oct 2, 2024 • edited Loading

jcblemai commented Oct 3, 2024

TimothyWillard commented Oct 21, 2024

pearsonca commented Oct 22, 2024

TimothyWillard commented Oct 22, 2024

pearsonca commented Oct 22, 2024

TimothyWillard commented Oct 22, 2024 • edited Loading

TimothyWillard commented Oct 2, 2024 •

edited

Loading

TimothyWillard Oct 18, 2024 •

edited

Loading

jcblemai commented Oct 2, 2024 •

edited

Loading

TimothyWillard commented Oct 22, 2024 •

edited

Loading